Conversation
binaryaaron
left a comment
There was a problem hiding this comment.
claude was a liar about unsloth and aarch64. also, pull in this change and test again
59d13ec to
83fff62
Compare
binaryaaron
left a comment
There was a problem hiding this comment.
if we want to support dgx spark quickly, the different install path in the container is fine but it's not proper support. let's make a follow up for this - we'll need to do a proper spike on cross-platform support + cuda version support.
| "vllm==0.15.0", | ||
| "xformers==v0.0.33.post2; sys_platform == 'linux'", | ||
| "xformers==v0.0.33.post2; sys_platform == 'linux' and platform_machine == 'x86_64'", | ||
| ] |
There was a problem hiding this comment.
this is probably going to be a whole thing but we should make a new dep group or do better platform resolution for cross-platform deps.
There was a problem hiding this comment.
Agree this needs a proper spike. For now the platform markers are minimal and don't break x86_64 right?
There was a problem hiding this comment.
hmmm do we want to keep the sys_platform markers? locking is borked in CI right now and I've been fighting it by putting linux markers in
There was a problem hiding this comment.
We really will need a spark and/or station as part of our github runners so we make sure this continues to work.
There was a problem hiding this comment.
Let me know what to change here.
binaryaaron
left a comment
There was a problem hiding this comment.
approving conditionally so i don't block while i'm out
|
We've also got to publish the container somewhere in the future, same with the cuda one. |
a113a39 to
655ca6c
Compare
8408cca to
819bdb8
Compare
- containers/Dockerfile.spark: container-based install using nvcr.io/nvidia/vllm:26.02-py3 - docs/DGX_SPARK.md: quick start guide (build + run in 2 steps) - pyproject.toml: platform markers for aarch64-incompatible packages (faiss-gpu-cu12, torchvision+cu128, torchao, xformers) - config/training.py: auto-fallback Flash Attention 3 to sdpa on aarch64 - vllm_backend.py: handle vllm versions without attention_config kwarg Signed-off-by: mvansegbroeck <mvansegbroeck@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: mvansegbroeck <mvansegbroeck@gmail.com>
onnxruntime is required by gliner but not always resolved transitively on aarch64. opt_einsum is directly imported in dp_transformers/linear.py but was never declared. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: mvansegbroeck <mvansegbroeck@gmail.com>
Signed-off-by: mvansegbroeck <mvansegbroeck@gmail.com>
f3b9988 to
c7f1cf7
Compare
…lockfile Signed-off-by: mvansegbroeck <mvansegbroeck@gmail.com>
|
|
||
| # 2. Torch-dependent packages — --no-deps preserves the container's torch/CUDA | ||
| RUN pip install --no-deps \ | ||
| peft accelerate bitsandbytes datasets==4.3.0 trl==0.26.1 \ |
There was a problem hiding this comment.
🤔 should we add an extras group for them? mostly to not have these versions be different than what's in the lockfile
There was a problem hiding this comment.
--no-deps install is intentional to preserve the container's torch/CUDA
| "vllm==0.15.0", | ||
| "xformers==v0.0.33.post2; sys_platform == 'linux'", | ||
| "xformers==v0.0.33.post2; sys_platform == 'linux' and platform_machine == 'x86_64'", | ||
| ] |
There was a problem hiding this comment.
hmmm do we want to keep the sys_platform markers? locking is borked in CI right now and I've been fighting it by putting linux markers in
| "vllm==0.15.0", | ||
| "xformers==v0.0.33.post2; sys_platform == 'linux'", | ||
| "xformers==v0.0.33.post2; sys_platform == 'linux' and platform_machine == 'x86_64'", | ||
| ] |
There was a problem hiding this comment.
We really will need a spark and/or station as part of our github runners so we make sure this continues to work.
94bf848 to
2688050
Compare
- Move DGX_SPARK.md to docs/developer-guide/ and add to mkdocs nav
- Make DGX Spark doc description Spark-specific - Remove SSH clone hint (repo is public) - Update docs link to github pages URL - Combine nested if into single condition in training.py
Summary
Add DGX Spark (aarch64) support for Safe Synthesizer:
containers/Dockerfile.cuda-aarch64for building on aarch64/Sparkdocs/DGX_SPARK.md— quickstart guide for running NSS on DGX Sparkonnxruntimeandopt_einsumtocu128extrapragma: allowlist secretfor API key placeholder in docsPre-Review Checklist
make format && make checkor via prek validation.make testpasses locallymake test-e2epasses locallymake test-ci-containerpasses locally (recommended)/syncon this PR to trigger a run (auto-triggers on ready-for-review)Pre-Merge Checklist